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1. INTRODUCTION 

Impaired people is increasing every year due to ageing, accidents and also diseases such as paralysis, 
spinal cord injuries (SCI), amputation and quadriplegia [1]. This group of people require a special wheelchair 
to move such as a motorised wheelchair. Motorised wheelchair consists of electric motor and joystick at the 
armrest. However, disabled people with paralysed hand or having a problem with the hand cannot operate 
joystick on the motorised wheelchair to move. Therefore, a lot of researches studied to help the impaired 
people move the wheelchair-using voice recognition [1-3]. This wheelchair can be called smart or intelligent 
wheelchair because it can be moved using only the voice. The first intelligent wheelchair was developed by the 
Siamo University of Alcala, Spain in 1999 [4]. 

The improvement of voice and speech identification technology has been started by Texas 
Instruments in 1960 [5]. Recently, voice recognition or speech recognition has been applied in assisting people 
doing work through digital devices such as mobile phone, tablets, and personal computer. There are also voice 
recognition software and webpages i.e. google applications, translation software, and personal assistants such 
as Alexa, Cortana and Siri. This personal assistant can be called modern chatbot which can assist people in any 
topics they would like to ask. 

The advancement of technology in artificial intelligence (AI) have created a smart wheelchair or 
intelligent wheelchair exploiting the voice recognition features. A lot of researches have been conducted to 
improve the functionality of the wheelchair. Nasrin [1] proposed an application on smart wheelchair-using 
voice and add GPS function to track the user navigation and location. This application required WIFI 
connection and the application is installed in a mobile environment. However, Avutu et. al. [3] proposed a 
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low-cost map for voice recognition wheelchair which can be applied in a local location. This application can 
be used without any network connection. The cost is much cheaper compared to the application using a 
network connection. Barriuso [6] proposed a smartphone application for a wheelchair with an agent-based 
intelligent interface. Another application was developed using Arduino Mega 2560 by [7]. 
The special features in this application are the emergency messages can be sent to the important people. With 
the ultrasonic sensor, any obstacle can be detected and avoided without internet connection required. 
There is also an application of intelligent wheelchair for people with paraplegia problem proposed by [8]. 
The application applied voice recognition and touch screen control. Meanwhile, Chauhan [9] did a 
comparative study on voice recognition wheelchair and proposed a new design deploying infrared sensor and 
Rasberry Pi through USB microphone. 

The development of smart or intelligent wheelchair mostly based on AI technologies. The process 
of voice recognition application will begin extracting features and classification. There are a lot 
of features extraction technique for voice recognition such as Linear Prediction Coefficients (LPC), 
Discrete Wavelet Transform (DWT), Line Spectral Frequencies (LSF), and Mel-Frequency Cepstrum 
Coefficients (MFCC) [10]. 

Classification techniques can be applied in classifying and recognising the voice, especially in 
wheelchair application. The most popular classification techniques and promising with high accuracy are 
convolutional neural networks (CNNs). CNNs is based on feedforward neural network and have better 
generalisation than networks with full connectivity between adjacent layers [11]. Moreover, CNNs has been 
successfully applied in many applications especially in image identification [12-15], surveillance [16], 
and human recognition [17-19]. Lei and She [20] authenticate voice using CNNs method in a noisy 
environment. Their research produces better accuracy and reduces an equal error rate. Guan [21] optimise 
performance in speech recognition using CNNs and the recognition performance has increased with an error 
rate of 13.88%. Sharifuddin et.al compared CNNs and BPNNs on voice control wheelchair applications and 
produces high accuracy on CNNs technique [5]. 

Support Vector Machine (SVM) can be categorised as one of the best classifications techniques. 
SVM also has been applied in various applications for instance in biometrics [17, 22-23], sentiment analysis 
[24-25] and security such as intrusion detection [26-27]. Selvakumari and Radha [28] applied SVM in 
classifying speech pathology and achieved 98% accuracy compared to the Naive Bayes algorithm. 
Furthermore, Astuti and Riyandwita [29] applying SVM in recognising voice for starting a car engine. 
SVM classified the word ‘on’ and ‘off car engine with 92.15% accuracy. Meanwhile, Harvianto et. al [30] 
analyse the voice of Indonesia Language using MFCC and SVM achieved high accuracy which is 91.83%. 

In this paper, we propose an intelligent wheelchair-using voice recognition. There are four types of 
voice command to be recognized which are go, left, right and stop. The data is collected from Google and 
some of them are self-recorded. Features from the data are extracted using MFCC techniques. Followed by the 
recognition using CNNs. We also classify the voice using the SVM method to compare the efficiency and 
accuracy between the two techniques. This paper is organised as follows. Section 2 presents research methods 
on voice recognition, CNNs and SVM. Experimentation details and results with the comparisons between 
CNNs and SVM are presented in section 3. Section 4 discusses the conclusion of the paper. 


2. RESEARCH METHOD 

This paper focuses on five types of voice data. Speech signal is based on single words; speakers are 
independent; language is English; vocabulary type is small, and background noise is true. Table | presents the 
five types of voice data used in this paper. 


Table 1. Types of collected voice data 


Type Description 
Speech signal Single words 
Speaker Individual 
Language English 
Vocabulary Small 
Background noise Yes 


Development board contains microprocessor and microcontroller in a small, compact circuit board. 
The development board receive the input, process the command, and produce the output. Several development 
boards are in the market with their strengths and limitations, for example Arduino, Banana PI, and Raspberry 
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PI. Raspberry PI 3B+ will be used in this project due to the following reasons; fast processing time, 
low cost and easy to program. 

In voice recognition, CNNs is seen as a technique capable of providing high accuracy in solving 
especially in speech and image recognition problems [20]. CNNs is good in highlighting important points, 
as well as filtering noise in the dataset. CNNs consist of feature extraction layer and classification layer. 


2.1. Feature extraction layer 

Feature extraction layer contains a convolutional layer and pooling layer. In this section, data will be 
extracted before been feed into the classification layer. We implement 2D Convolutional Layer and three 2D 
Max Pooling layers in this paper. 


2.2. Classification layer 

Classification layer aims to classify the data received from the feature extraction layer. In this paper, 
the classification layer consists of an input layer, a hidden layer, and an output layer. 

On the other hand, this paper also highlights the implementation of SVM to classify the 
extracted voice data. SVM is chosen because of the ability to perform multi-class classification with high 
dimensional spaces. Three kernel functions are used in the experiment, i.e. polynomial, RBF, and sigmoid. 
We then conduct another experiment to calculate the effect of C, gamma, and degree using the kernel result. 

Figure | presents the flow of the proposed system [5]. We described the flow based on three 
modules: the input (the user i.e. the disabled people), the process, and the output. Firstly, the user pushes the 
button to record the voice command using the portable microphone. MFCC and CNNs are used in the 
Raspberry PI 3B+ to process the recorded voice. The output signal is then sent to the motor driver in the 
Raspberry PI 3B+ after the system recognized the voice. The motor driver controls the movement of the 
motor to direct the robot car. 
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Figure 1. Flow of the system 


3. EXPERIMENTATIONS AND RESULTS 
3.1. Data Sample 

There are four types of voice command used in the experiment. These commands are right, 
left, stop, and go. The data was gathered from Google. There are 2,372 data for go command, 2,380 for stop 
command, 2,353 for left command, and 2,367 for right command. There are also urban noise and white noise 
data collected for the experiment. 2,373 urban noise data are collected from Google while 2,300 white noise 
data are self-recorded [5] because the data is not available online. Each voice data is prepared in one-second 
length and is saved in WAV format. The total number of downloaded and recorded voice command used in 
the experiment is presented in Table 2. 
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Table 2. Total number of collected voice data 


Types Data 
Go 2,372 
Stop 2,380 
Left 2,353 
Right 2,367 
Urban Noise 2,373 
White Noise 2,300 
Total 14,145 


3.2. Mel frequency cepstral coefficients (MFCC) 

MECC is the technique used to extract features from audio signal. MFCC used the frequency band 
logarithmically, which allow better speech processing [31]. Signal features are first went through feature 
extraction and later feature matching. The processing of MFCC is applied using Librosa library. There are six 
steps involve in MFCC features extraction i.e. pre-emphasis filter, framing the speech sample, windowing, 
fast fourier transform (FFT), mel filter bank, and log energy and discrete cosine transform (DCT). 


3.3. Data preparation 

This research using 2-dimensional (2D) CNNs. The data is directly taken from raw MFCC because 
the dataset is in 2D format and no requirement to compress it. The size of the data has to be in the same 
format which is (44, 20). This format is being set based on one-second length of data. If the size of the data 
exceeded, then the excess data will be cut off. However, if the size is less than the format size (44,20) then it 
will be pad with zero. Finally, the data is ready to be feed in the 2D CNNs. Figure 2 illustrates the data 
preparation process of CNNs [5]. 


Data Pad with zero if Feed the 
Normalization output shape is data inside 
(0-1) less than 2D CNNs 


Extract the 
signal using 


MFCC 
Output shape: 
(44,20) 


Output shape: (44,20) and cut 
(44.20) off if it is more 
than that 


Figure 2. CNNs data preparation 


3.4. SVM parameter tuning 

Support Vector Machine (SVM) is another technique used to classify the voice data. In order to 
fine-tune the SVM, there are fourteen experiments have been conducted. The parameter used is in this 
experiment is a type of kernel, C which is regularization parameter, gamma and degree. We conducted three 
experiments using three different kernel functions i.e. polynomial, RBF and sigmoid. The result shown that 
the polynomial kernel achieved the higher accuracy. Next, we conducted another experiments to measure the 
effect of C continue with the effect of gamma and the effect of degree. We achived the best accuracy result 
that is 72.39% with kernel type polynomial, the effect of C is 10, the effect of gamma is scale, and the effect 
of degree is 1 as shown in Table 3. 


Table 3. SVM best model 
Kernel Cc Gamma Degree Accuracy 
Poly 10 Scale 1 72.39% 


3.5. CNNs parameter tuning 

Twelve experiments were conducted to tune CNNs parameter. From the experiments, we described 
the best-selected model in Table 4 [6]. Table 4 describes the layers, kernel size, stride, filters, padding, nodes, 
bias, and activation for the CNNs model. We achieved 95.30% accuracy with 4.1672 minutes from the 
implemented model. 
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Table 4. The best model of CNNs 


Layers Kernel Size Stride Filters Padding Nodes Bias Activation 
Input 20,44,1 
2D Convolutional 2 1 16 valid - False Relu 
2D Max Pooling 2 2 - valid - - - 
2D Convolutional 2 1 32 valid - False Relu 
2D Max Pooling 2 2 - valid - - - 
2D Convolutional 2 1 64 valid - False Relu 
2D Max Pooling 2 2 - valid - - - 
Flatten - - - 
Output - 6 False Softplus 
Optimizer Adam 
Loss Function mean_squared_error 
Epochs 50 
Batch Size 10 


3.5. Comparison between SVM and CNNs 

Based on the conducted experiment, CNNs achieved higher accuracy that is 95.30% with compared 
to SVM that is 72.39%. In term of time, SVM took shorter processing time with only 8.21 second rather than 
CNNs with 250.03 seconds to run. The capability of filtering the noise in the voice data in the CNNs feature 
extraction layer produces higher accuracy compared to SVM. On the other hand, the less complexity 
implementation in SVM give shorter processing time. Table 5 displays the accuracy and time result for 
CNNs and SVM. 


Table 5. Comparison of results 


Model Accuracy Time (s) 
SVM 72.39% 8.21 
CNNs 95.30% 250.03 


3.6. Hardware tuning 

To demonstrate the implementation of the voice recognition process, we integrate logic gates in the 
Motor Diver LN298N. The description of the logic gates is shown in Table 6. The logic gates control the 
movement of the motor in four directions, i.e. go, stop, left, and right [6]. 


Table 6. Implementation of motor driver 


Motor Driver Pin LN298N Motor Direction 
Inl In2 In3 In4 
0 0 0 0 Stop 
0 1 0 1 Go 
1 0 0 1 Left 
0 1 1 0 Right 


4. CONCLUSION 

In this paper, we propose an intelligent wheelchair-using voice recognition. There are four types of 
voice command to be recognized which are go, left, right and stop. The data is collected from Google and 
some of them are self-recorded. Features from the data are extracted using MFCC techniques. Followed by 
the recognition using CNNs and SVM. CNNs produced higher accuracy i.e. 95.30% compared to SVM 
which is only 72.39%. On the other hand, SVM only took 8.21 seconds while CNNs took 250.03 seconds to 
execute. Therefore, CNNs produce better result because noise are filtered in the feature extraction layer 
before classified in the classification layer. However, CNNs took longer time due to the complexity of the 
networks and the less complexity implementation in SVM give shorter processing time. 
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